Note: To view output with code, go to this page on Rpubs.
Principal investigator Dr. Vinca Monster of the Grape Program at State U needs me, a poor graduate student, to help her understand influences of physico-chemical properties on wine preferences. Her laboratory has gathered an extensive dataset on Portugese white varietals.
This report uses exploratory data analysis and linear regression to determine associations of wine properties on preference.
##
## Please cite as:
## Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2. http://CRAN.R-project.org/package=stargazer
## Parsed with column specification:
## cols(
## `fixed acidity` = col_double(),
## `volatile acidity` = col_double(),
## `citric acid` = col_double(),
## `residual sugar` = col_double(),
## chlorides = col_double(),
## `free sulfur dioxide` = col_double(),
## `total sulfur dioxide` = col_double(),
## density = col_double(),
## pH = col_double(),
## sulphates = col_double(),
## alcohol = col_double(),
## quality = col_integer()
## )
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
It doesn’t seem as if there is any unaccounted missing data.
Density, fixed acidity, and free sulfur dioxide seem like candidates to be logged in order to help with distribution (their slopes are definitely not zero).
Now I’ll try the logged variables. Those transformations don’t seem to help much visually, but I will try my model with and without them logged to check.
Finally, I just want to check correlations to develop a better idea of what I’m putting into my model.
## quality alcohol chlorides citric.acid
## quality 1.000000000 0.43557472 -0.20993441 -0.009209091
## alcohol 0.435574715 1.00000000 -0.36018871 -0.075728730
## chlorides -0.209934411 -0.36018871 1.00000000 0.114364448
## citric.acid -0.009209091 -0.07572873 0.11436445 1.000000000
## density.log -0.307723788 -0.78135429 0.25757406 0.149442828
## fixed.acidity.log -0.109736681 -0.13108514 0.03305563 0.292566248
## free.sulf.diox.log 0.099058582 -0.22409197 0.09168564 0.084276395
## pH 0.099427246 0.12143210 -0.09043946 -0.163748211
## residual.sugar -0.097576829 -0.45063122 0.08868454 0.094211624
## sulphates 0.053677877 -0.01743277 0.01676288 0.062330940
## total.sulfur.dioxide -0.174737218 -0.44889210 0.19891030 0.121130798
## volatile.acidity -0.194722969 0.06771794 0.07051157 -0.149471811
## density.log fixed.acidity.log free.sulf.diox.log
## quality -0.30772379 -0.10973668 0.09905858
## alcohol -0.78135429 -0.13108514 -0.22409197
## chlorides 0.25757406 0.03305563 0.09168564
## citric.acid 0.14944283 0.29256625 0.08427640
## density.log 1.00000000 0.27695036 0.28317156
## fixed.acidity.log 0.27695036 1.00000000 -0.04534913
## free.sulf.diox.log 0.28317156 -0.04534913 1.00000000
## pH -0.09368819 -0.43478921 0.02199554
## residual.sugar 0.83864966 0.10237716 0.30293472
## sulphates 0.07444942 -0.01415546 0.06084248
## total.sulfur.dioxide 0.53044357 0.10259928 0.59619976
## volatile.acidity 0.02661505 -0.02974209 -0.11663198
## pH residual.sugar sulphates
## quality 0.099427246 -0.09757683 0.05367788
## alcohol 0.121432099 -0.45063122 -0.01743277
## chlorides -0.090439456 0.08868454 0.01676288
## citric.acid -0.163748211 0.09421162 0.06233094
## density.log -0.093688189 0.83864966 0.07444942
## fixed.acidity.log -0.434789207 0.10237716 -0.01415546
## free.sulf.diox.log 0.021995543 0.30293472 0.06084248
## pH 1.000000000 -0.19413345 0.15595150
## residual.sugar -0.194133454 1.00000000 -0.02666437
## sulphates 0.155951497 -0.02666437 1.00000000
## total.sulfur.dioxide 0.002320972 0.40143931 0.13456237
## volatile.acidity -0.031915368 0.06428606 -0.03572815
## total.sulfur.dioxide volatile.acidity
## quality -0.174737218 -0.19472297
## alcohol -0.448892102 0.06771794
## chlorides 0.198910300 0.07051157
## citric.acid 0.121130798 -0.14947181
## density.log 0.530443572 0.02661505
## fixed.acidity.log 0.102599278 -0.02974209
## free.sulf.diox.log 0.596199757 -0.11663198
## pH 0.002320972 -0.03191537
## residual.sugar 0.401439311 0.06428606
## sulphates 0.134562367 -0.03572815
## total.sulfur.dioxide 1.000000000 0.08926050
## volatile.acidity 0.089260504 1.00000000
The variable most correlated with quality is alcohol, so I will use that as my primary independent variable. Density is highly correlated with alcohol (r=-.78) and residual sugar (r=.84),and free sulfur dioxide less so with total sulfur dioxide (r=.59), so there may be some collinearity isues there.
Below is my first regression, with all potential independent variables included.
##
## Call:
## lm(formula = quality ~ alcohol + chlorides + citric.acid + density.log +
## fixed.acidity.log + free.sulf.diox.log + pH + residual.sugar +
## sulphates + total.sulfur.dioxide + volatile.acidity, data = White_wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4172 -0.5008 -0.0287 0.4585 3.0836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.249e+00 5.481e-01 -2.279 0.022726 *
## alcohol 1.976e-01 2.411e-02 8.199 3.07e-16 ***
## chlorides -4.037e-01 5.383e-01 -0.750 0.453354
## citric.acid 3.611e-03 9.443e-02 0.038 0.969494
## density.log -9.368e+01 1.311e+01 -7.144 1.04e-12 ***
## fixed.acidity.log 3.700e-01 9.982e-02 3.707 0.000212 ***
## free.sulf.diox.log 2.163e-01 1.780e-02 12.155 < 2e-16 ***
## pH 6.609e-01 1.052e-01 6.285 3.57e-10 ***
## residual.sugar 7.305e-02 7.478e-03 9.768 < 2e-16 ***
## sulphates 6.300e-01 9.898e-02 6.364 2.14e-10 ***
## total.sulfur.dioxide -1.901e-03 3.695e-04 -5.146 2.76e-07 ***
## volatile.acidity -1.651e+00 1.130e-01 -14.615 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7412 on 4886 degrees of freedom
## Multiple R-squared: 0.3012, Adjusted R-squared: 0.2996
## F-statistic: 191.4 on 11 and 4886 DF, p-value: < 2.2e-16
Interpretation: For each 1-unit increase in alcohol (I am guessing 1 percent alcohol content), the rating of quality increases by 0.198 on a 7 point scale, holding all other variables (different qualities of the wine) constant. This is significant at p<.001.
This interpretation could be extended to any of the other independent variables. For example, a 1 unit increase of chlorides is associated with a .404 decrease in rating of wine quality, all other independent variables held constant; however, the p-value, .65, is not significant.
##
## Call:
## lm(formula = quality ~ alcohol + chlorides + citric.acid + density +
## fixed.acidity + free.sulfur.dioxide + pH + residual.sugar +
## sulphates + total.sulfur.dioxide + volatile.acidity, data = White_wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8348 -0.4934 -0.0379 0.4637 3.1143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.502e+02 1.880e+01 7.987 1.71e-15 ***
## alcohol 1.935e-01 2.422e-02 7.988 1.70e-15 ***
## chlorides -2.473e-01 5.465e-01 -0.452 0.65097
## citric.acid 2.209e-02 9.577e-02 0.231 0.81759
## density -1.503e+02 1.907e+01 -7.879 4.04e-15 ***
## fixed.acidity 6.552e-02 2.087e-02 3.139 0.00171 **
## free.sulfur.dioxide 3.733e-03 8.441e-04 4.422 9.99e-06 ***
## pH 6.863e-01 1.054e-01 6.513 8.10e-11 ***
## residual.sugar 8.148e-02 7.527e-03 10.825 < 2e-16 ***
## sulphates 6.315e-01 1.004e-01 6.291 3.44e-10 ***
## total.sulfur.dioxide -2.857e-04 3.781e-04 -0.756 0.44979
## volatile.acidity -1.863e+00 1.138e-01 -16.373 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared: 0.2819, Adjusted R-squared: 0.2803
## F-statistic: 174.3 on 11 and 4886 DF, p-value: < 2.2e-16
This is the same model without density, fixed acidity, and free sulfur dioxide logged. The R-squared is noticeably reduced by about .02, so I’ll keep those variables logged.
##
## Call:
## lm(formula = quality ~ alcohol + density.log + fixed.acidity.log +
## free.sulf.diox.log + pH + residual.sugar + sulphates + total.sulfur.dioxide +
## volatile.acidity, data = White_wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4167 -0.5008 -0.0281 0.4551 3.0856
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.357701 0.528272 -2.570 0.0102 *
## alcohol 0.197790 0.023995 8.243 < 2e-16 ***
## density.log -95.276842 12.898858 -7.386 1.76e-13 ***
## fixed.acidity.log 0.381585 0.097972 3.895 9.96e-05 ***
## free.sulf.diox.log 0.215757 0.017773 12.139 < 2e-16 ***
## pH 0.673841 0.103283 6.524 7.53e-11 ***
## residual.sugar 0.074140 0.007321 10.126 < 2e-16 ***
## sulphates 0.632422 0.098843 6.398 1.72e-10 ***
## total.sulfur.dioxide -0.001903 0.000369 -5.158 2.60e-07 ***
## volatile.acidity -1.658880 0.110856 -14.964 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7411 on 4888 degrees of freedom
## Multiple R-squared: 0.3011, Adjusted R-squared: 0.2998
## F-statistic: 233.9 on 9 and 4888 DF, p-value: < 2.2e-16
Below, I try model 1 without density, recalling that density is highly correlated with two other variables and has a high standard error in the previous models.
##
## Call:
## lm(formula = quality ~ alcohol + fixed.acidity.log + free.sulf.diox.log +
## pH + residual.sugar + sulphates + total.sulfur.dioxide +
## volatile.acidity, data = White_wines)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3023 -0.5071 -0.0281 0.4480 3.1178
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1519959 0.4067184 2.832 0.00464 **
## alcohol 0.3574313 0.0104813 34.102 < 2e-16 ***
## fixed.acidity.log -0.1357944 0.0688745 -1.972 0.04871 *
## free.sulf.diox.log 0.2379961 0.0176120 13.513 < 2e-16 ***
## pH 0.1981555 0.0811874 2.441 0.01469 *
## residual.sugar 0.0232764 0.0025007 9.308 < 2e-16 ***
## sulphates 0.4362305 0.0957271 4.557 5.31e-06 ***
## total.sulfur.dioxide -0.0024790 0.0003626 -6.837 9.09e-12 ***
## volatile.acidity -1.7508316 0.1107562 -15.808 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7451 on 4889 degrees of freedom
## Multiple R-squared: 0.2933, Adjusted R-squared: 0.2921
## F-statistic: 253.6 on 8 and 4889 DF, p-value: < 2.2e-16
Interpretation: With density dropped, there are no anomolous standard errors. R-squared decreases slightly, but not enough to be practically significant. This seems like a good candidate for the final model, but I’ll do diagnostics below.
| Dependent variable: | |||
| quality | |||
| (1) | (2) | (3) | |
| alcohol | 0.198*** | 0.198*** | 0.357*** |
| (0.024) | (0.024) | (0.010) | |
| chlorides | -0.404 | ||
| (0.538) | |||
| citric.acid | 0.004 | ||
| (0.094) | |||
| density.log | -93.679*** | -95.277*** | |
| (13.113) | (12.899) | ||
| fixed.acidity.log | 0.370*** | 0.382*** | -0.136** |
| (0.100) | (0.098) | (0.069) | |
| free.sulf.diox.log | 0.216*** | 0.216*** | 0.238*** |
| (0.018) | (0.018) | (0.018) | |
| pH | 0.661*** | 0.674*** | 0.198** |
| (0.105) | (0.103) | (0.081) | |
| residual.sugar | 0.073*** | 0.074*** | 0.023*** |
| (0.007) | (0.007) | (0.003) | |
| sulphates | 0.630*** | 0.632*** | 0.436*** |
| (0.099) | (0.099) | (0.096) | |
| total.sulfur.dioxide | -0.002*** | -0.002*** | -0.002*** |
| (0.0004) | (0.0004) | (0.0004) | |
| volatile.acidity | -1.651*** | -1.659*** | -1.751*** |
| (0.113) | (0.111) | (0.111) | |
| Constant | -1.249** | -1.358** | 1.152*** |
| (0.548) | (0.528) | (0.407) | |
| Observations | 4,898 | 4,898 | 4,898 |
| R2 | 0.301 | 0.301 | 0.293 |
| Adjusted R2 | 0.300 | 0.300 | 0.292 |
| Residual Std. Error | 0.741 (df = 4886) | 0.741 (df = 4888) | 0.745 (df = 4889) |
| F Statistic | 191.408*** (df = 11; 4886) | 233.949*** (df = 9; 4888) | 253.595*** (df = 8; 4889) |
| Note: | p<0.1; p<0.05; p<0.01 | ||
## Test stat Pr(>|t|)
## alcohol 5.270 0.000
## chlorides 1.403 0.161
## citric.acid -4.424 0.000
## density.log 5.229 0.000
## fixed.acidity.log -3.103 0.002
## free.sulf.diox.log -11.193 0.000
## pH 0.964 0.335
## residual.sugar 2.481 0.013
## sulphates 0.047 0.963
## total.sulfur.dioxide -8.039 0.000
## volatile.acidity 3.210 0.001
## Tukey test 2.656 0.008
## Test stat Pr(>|t|)
## alcohol 5.203 0.000
## density.log 5.265 0.000
## fixed.acidity.log -3.020 0.003
## free.sulf.diox.log -11.170 0.000
## pH 0.959 0.338
## residual.sugar 2.507 0.012
## sulphates 0.063 0.950
## total.sulfur.dioxide -8.013 0.000
## volatile.acidity 3.228 0.001
## Tukey test 2.540 0.011
## Test stat Pr(>|t|)
## alcohol 5.269 0.000
## fixed.acidity.log -3.706 0.000
## free.sulf.diox.log -10.920 0.000
## pH 0.232 0.817
## residual.sugar -1.439 0.150
## sulphates 0.095 0.924
## total.sulfur.dioxide -7.631 0.000
## volatile.acidity 1.906 0.057
## Tukey test 0.942 0.346
This shows model 3 as a good fit. Without reducing R-squared much, it seems clear the model can do without density and citric acid. All the other independent variables stay statistically significant.
Are their any influential outliers?
Below identifies influential observations.
Below identifies observations with large residuals.
## 3308 446 3811
## 1 2 3
Below identifies outliers.
## rstudent unadjusted p-value Bonferonni p
## 3308 -4.449019 8.818e-06 0.04319
Below determines whether any points are highly influential.
NB. If there are points that are a) outliers AND b) highly influential, these have potential to change the inference. You should consider removing them.
To make sense of the plots above, I create an influence plot. Residuals of +/-2 can be problematic.
## StudRes Hat CookD
## 446 -4.27799398 0.001869208 3.794668e-03
## 1418 -3.04787028 0.012061124 1.257976e-02
## 1932 -3.56197929 0.006915725 9.793883e-03
## 1952 -0.25148495 0.014763853 1.053232e-04
## 2782 -0.08782465 0.054472921 4.938388e-05
## 3308 -4.44901850 0.003946058 8.679611e-03
## 3811 -4.27258334 0.001088782 2.203041e-03
## 4746 -4.03477332 0.015318536 2.805189e-02
It looks like there are 5 cases that may be problematic, cases 741, 775, 3308, 3902, and 4746.
Another diagnostic is to test for heteroskedasticity (i.e., the variance of the error term is not constant).
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 9.170889 Df = 1 p = 0.002458951
It does appear there is significant heteroscedasticity in the model.
Below, I try dropping cases 741, 775, 3308, 3902, and 4746, then run an influence plot and test for heteroscedasticity again.
We also want to look for multicollinearity, that is are some of our independent variables highly correlated. We do this by looking at the Variance Inflation Factor (VIF). A GVIF > 4 suggests collinearity.
##
## Call:
## lm(formula = quality ~ alcohol + fixed.acidity.log + free.sulf.diox.log +
## pH + residual.sugar + sulphates + total.sulfur.dioxide +
## volatile.acidity, data = Wwines_dropouts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1770 -0.5052 -0.0265 0.4454 2.6473
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.157086 0.404152 2.863 0.00421 **
## alcohol 0.360551 0.010415 34.618 < 2e-16 ***
## fixed.acidity.log -0.145213 0.068516 -2.119 0.03411 *
## free.sulf.diox.log 0.236185 0.017507 13.491 < 2e-16 ***
## pH 0.193139 0.080637 2.395 0.01665 *
## residual.sugar 0.022931 0.002485 9.227 < 2e-16 ***
## sulphates 0.424142 0.095052 4.462 8.29e-06 ***
## total.sulfur.dioxide -0.002328 0.000362 -6.432 1.38e-10 ***
## volatile.acidity -1.743506 0.110081 -15.838 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7397 on 4884 degrees of freedom
## Multiple R-squared: 0.2973, Adjusted R-squared: 0.2961
## F-statistic: 258.2 on 8 and 4884 DF, p-value: < 2.2e-16
## StudRes Hat CookD
## 254 -3.83791590 0.005005846 8.210813e-03
## 446 -4.30207872 0.001891299 3.882777e-03
## 1228 -4.27141241 0.002118194 4.288016e-03
## 1416 -3.11394252 0.012155346 1.323380e-02
## 1930 -3.62686985 0.007001829 1.028028e-02
## 1950 -0.26729449 0.014777809 1.190957e-04
## 2780 -0.07648524 0.054649518 3.758328e-05
## 3808 -4.30490087 0.001089529 2.237896e-03
## 4036 -0.99870812 0.014808194 1.665774e-03
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 7.88333 Df = 1 p = 0.004989254
At this point, I have logged some of the variables, dropped two variables without reducing R-squared much, compared 3 models, and dropped a few cases to reduce heteroscedasticity. Unfortunately, there still appears to be some cases skewing the data. At this point, unless there are multicollinearity issues, I think I will take my model to Dr. Monster to consult on further methods for manipulating the data and adjusting the model appropriately.
But first, I check on multicollinearity.
## alcohol fixed.acidity.log free.sulf.diox.log
## 1.467500 1.286266 1.694721
## pH residual.sugar sulphates
## 1.324745 1.421177 1.052046
## total.sulfur.dioxide volatile.acidity
## 2.087468 1.098414
No variables seem to be causing any problems.
Controlling for the other independent variables, a 1-unit increase in alcohol is associated with a .36 increase in white wine ratings, statistically significant at p<.001 in the final model. To optimize ratings, adjusting the alcohol level higher seems like a good recommendation. Additionally, lower fixed acidity, higher free sulfur dioxides, a higher pH level, more residual sugars, higher sulphate levels, less total sulfur dioxide, and less volatile acidity all would likely contribute to higher ratings as they are statistically significant predictors. Theoretically, I would imagine wine qualities to hit saturation points where increasing or decreasing certain qualities will no longer have a positive outcome on wine ratings. I might discuss with Dr. Monster the possibility of a nonlinear model producing a better fit, or some other method for determining thresholds.
The link to my Github account is https://github.com/craigalder. The link to my repository for this assignment is https://github.com/craigalder/N741gapminder1.git.